CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance

نویسندگان
چکیده

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency b...

متن کامل

A Survey on Linguistic Structures for Application-level Fault-Tolerance

The structures for the expression of fault-tolerance provisions into the application software are the central topic of this paper. Structuring techniques answer the questions \How to incorporate fault-tolerance in the application layer of a computer program" and \How to manage the faulttolerance code". As such, they provide means to control complexity, the latter being a relevant factor for the...

متن کامل

Exploiting Application-Level Correctness for Low-Cost Fault Tolerance

Traditionally, fault tolerance researchers have required architectural state to be numerically perfect for program execution to be correct. However, in many programs, even if execution is not 100% numerically correct, the program can still appear to execute correctly from the user’s perspective. Hence, whether a fault is unacceptable or benign may depend on the level of abstraction at which cor...

متن کامل

Performance Tradeoffs in Policies for Application Level Fault Tolerance

Object oriented applications and services are composed of a number of objects with instances, which interact to accomplish common goals. Fault tolerance is attained via application transparent replication policies for masking faults that do not recur after recovery. Recently, we realized the advent of a number of middleware infrastructures and services, which allow customizing the replication c...

متن کامل

Application-Level Resilience Modeling for HPC Fault Tolerance

Understanding the application resilience in the presence of faults is critical to address the HPC resilience challenge. Currently we largely rely on random fault injection (RFI) to quantify the application resilience. However, RFI provides liŠle information on how fault tolerance happens, and RFI results are o‰en not deterministic due to its random nature. In this paper, we introduce a new meth...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Parallel and Distributed Systems

سال: 2019

ISSN: 1045-9219,1558-2183,2161-9883

DOI: 10.1109/tpds.2018.2866794